Goto

Collaborating Authors

 orthogonal transformer




Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization

Neural Information Processing Systems

We present a general vision transformer backbone, called as Orthogonal Transformer, in pursuit of both efficiency and effectiveness. A major challenge for vision transformer is that self-attention, as the key element in capturing long-range dependency, is very computationally expensive for dense prediction tasks (e.g., object detection). Coarse global self-attention and local self-attention are then designed to reduce the cost, but they suffer from either neglecting local correlations or hurting global modeling. We present an orthogonal self-attention mechanism to alleviate these issues. Specifically, self-attention is computed in the orthogonal space that is reversible to the spatial domain but has much lower resolution.


Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization A Proof of Theorem 1

Neural Information Processing Systems

Herein we provide the proof of Theorem 1 in the main text. Proof A.2 We can construct the Householder matrix with vector u = Q is the product of n 1 orthogonal Householder matrices. Proof A.5 With Lemma A.3, we can upper triangularize the given real orthogonal matrix A as: H We train the models with two common settings: "1 The AdamW optimizer is used with learning rate of 0.0001, weight decay of 0.05 and batch-size of 16. We apply Orthogonal Transformer pretrained on ImageNet-1K as the backbone network. I and Fig.II show the detailed architectures of the convolutional patch embedding and the The last convolution is with the kernel-size of 1 1, following by a LayerNorm layer.



Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization

Neural Information Processing Systems

We present a general vision transformer backbone, called as Orthogonal Transformer, in pursuit of both efficiency and effectiveness. A major challenge for vision transformer is that self-attention, as the key element in capturing long-range dependency, is very computationally expensive for dense prediction tasks (e.g., object detection). Coarse global self-attention and local self-attention are then designed to reduce the cost, but they suffer from either neglecting local correlations or hurting global modeling. We present an orthogonal self-attention mechanism to alleviate these issues. Specifically, self-attention is computed in the orthogonal space that is reversible to the spatial domain but has much lower resolution.